Football is a very exciting sport. Until now, this is the most popular game on the entire Earth planet. Sorry, not sorry about other games.
I want to review data collected since 1872 trying to understand how matches between countries have evolved up to this moment. So, we are calling to R and a few libraries to help us visualizing data:
library(tidyverse)
library(plotly)
library(lubridate)
The first thing is to read files. I downloaded this project at 2021-07-22 from Kaggle.
results <- read.csv("results.csv", encoding = "UTF-8")
This dataset contains data about \(42k+\) football matches in the history of international encounters between national teams. So, let’s take a little taste of the data:
head(results)
One interesting thing is to take a look at the context of the matches, some of them could be not relevant at all, however, there is also World cup matches, continental tournaments, and so on:
levels(as.factor(results$tournament)) -> tournaments
sample(tournaments,20)
## [1] "Amílcar Cabral Cup"
## [2] "Viva World Cup"
## [3] "British Championship"
## [4] "Gold Cup qualification"
## [5] "Dragon Cup"
## [6] "ELF Cup"
## [7] "Nordic Championship"
## [8] "AFF Championship qualification"
## [9] "Copa Lipton"
## [10] "AFF Championship"
## [11] "Kirin Cup"
## [12] "CONCACAF Nations League qualification"
## [13] "African Nations Championship"
## [14] "Copa del Pacífico"
## [15] "Windward Islands Tournament"
## [16] "Balkan Cup"
## [17] "AFC Challenge Cup"
## [18] "WAFF Championship"
## [19] "International Cup"
## [20] "Copa Roca"
Filtering by tournaments with at least 100 matches played in the history:
results %>%
group_by(tournament) %>%
summarise(count=n()) %>%
filter(count > 100) %>%
select(tournament) -> popularCups
results %>%
filter(tournament %in% popularCups$tournament) %>%
ggplot(aes(x=tournament, fill=tournament)) +
geom_bar() +
coord_flip() +
labs(title="Matches in tournaments") -> p
ggplotly(p)
Now we need to process a little bit of the data to assign a standard way to provide points based on the outcome of every match:
| Points | Outcome |
|---|---|
| \(3\) | Victory |
| \(1\) | Tie |
| \(0\) | Defeat |
In FIFA scores, 2 points can be achieved by winning a shootout after a tied match, however, I ignored that for the following analysis
Let’s take a look on how it looks now:
results %>%
mutate(tied=ifelse(home_score == away_score,TRUE,FALSE)) %>%
mutate(home_points=ifelse(tied == TRUE,1,ifelse(home_score > away_score,3,0))) %>%
mutate(away_points=ifelse(tied == TRUE,1,ifelse(home_score > away_score,0,3))) -> results
results %>%
filter(grepl("FIFA World Cup",tournament)) -> worldCupResults
head(worldCupResults)
After this step we also need to transform a little bit the structure of this dataset in order to measure the performance of each National Team in this way:
Then we can see how it looks (for tournaments that contain "FIFA World Cup" in its name).
results %>%
pivot_longer(c(home_team,away_team),names_to = "homeaway", values_to = "team") %>%
mutate(points=ifelse(grepl("home",homeaway),home_points,away_points),
goals=ifelse(grepl("home",homeaway),home_score,away_score),
receivedGoals=ifelse(grepl("home",homeaway),away_score,home_score)) %>%
select(date,tournament,country,team,points,goals,receivedGoals) -> results
results %>%
filter(grepl("FIFA World Cup",tournament)) -> worldCupResults
The most interesting matches occur at FIFA World Cup. So we can focus on what happens in this tournament:
worldCupResults %>% filter(!grepl("qualifi",tournament)) %>% mutate(yr=year(date)) %>% group_by(yr,team) %>% summarise( p=sum(points), goals=sum(goals), against=sum(receivedGoals),matches=n()) %>% mutate( performance=p/matches, ofensive=goals/matches, defense=against/matches) %>% ggplot(aes(x=yr, y=performance, fill=team)) + geom_bar(stat="identity") -> p
## `summarise()` has grouped output by 'yr'. You can override using the `.groups` argument.
ggplotly(p)
Germany emerges as the best in performance over all the matches related to the World Cup. Is not a surprise at all, remember all of the “goleadas” that has produced, in the qualifiers as well as in the knock-out matches in the final stages of the tournament.
Now we can take a look at what happens if we focus only on the final stage, I mean filtering out the qualifiers:
worldCupResults %>% filter(!grepl("qualifi",tournament)) %>% mutate(yr=year(date)) %>% group_by(yr,team) %>% summarise( p=sum(points), goals=sum(goals), against=sum(receivedGoals),matches=n()) %>% mutate( performance=p/matches, ofensive=goals/matches, defense=against/matches) %>% filter(team %in% c("Mexico","Brazil","Argentina","Germany","France")) %>% ggplot(aes(x=yr, y=performance, color=team)) + geom_line() -> p
## `summarise()` has grouped output by 'yr'. You can override using the `.groups` argument.
ggplotly(p)